For my project, I decided to explore “Red Wine Quality” dataset, to understand what are the factors that have impact on the quality of red wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
I want to create a new factor called “rating” and seperate out the wine that are “good”(quality =7/8), “average”(quality = 5/6) and “bad”(quality = 3/4).
## # A tibble: 3 x 2
## rating n
## <chr> <int>
## 1 average 1319
## 2 bad 63
## 3 good 217
After transforming quality into rating, it is more clear to see that the vast majority of the wine are “average” (1319 out of 1599 entries). This can be problematic for this exercise as we have so few data points to use for the other categories.
The data has 1599 observations and 11 variables, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol. The wine are rated on a 10 point scale, with 0 being the worst and 10 being the best.
I’m interested in understanding how quality is effected by the other factors.
For now, I think that most of the other features (other than X), will be helpful for the investigation.
Yes, I created “rating” from “quality”.
Most of the data approximate normal distribution, with the exception of Free Sulphur Dioxide, Total Sulphur Dioxide, Sulphate, which are positively skewed, and Citric Acid, whcih has a lot of zeros. I did not perform any operations to clean the data, since it’s clean already.
Looking at the correlation matrix above, I can see that quality has the biggest correlation with Alcohol, followed by Volatile Acidity (negative), Sulphates, and Citric Acid. I’d be interested in diving further into the paired variables later on. Fixed Acidity has a high correlation with Citric Acidity, which is not surprising, since they are both some type of acidity factor. Similarly, Citric Acidity is highly correlated (negative) to Volatile Acidity. Total Sulphur Dioxide is highly correlated with Free Sulfur Dioxide, we’ve seen in the histograms above that they follow similar distribution.
There seem to be a positive relationship between alcohol content and quality - higher quality alcohol tend to have higher alcohol content, while lower qualiyu alcohol tend to have lower alochol content. I also noted that there’s a huge overlap in range of alcohol content for the different qualities, which reaffirms that the relationship isn’t extremely strong.
There seem to be a negative relationship between volatile acidity and quality, lower quality alcohol tend to have higher volatile acidity, while higher quality alcohol seem to have lower volatile acidity. To better understand this relationship, I looked up what Volatile Acidity is, and I found that it is “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. This coincides with the relationship I discovered through my analysis.
The graph shows a negative relationship between pH and Fixed Acidity. Fixed Acidity describes how much acid is in the wine, while pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). So this relationship makes a lot of sense, since higher fixed acidity would mean lower pH, and vice versa.
There’s a strong positive relationship between density and fixed acidity, interesting.
First thing that jumped out at me was, there wasn’t a very strong relationship between quality and any factors. Quality is most correlated with alcohol content, where higher quality red wine tend to have higher alcohol percentage. Quality is negatively correlated with volatile acidity.
I thought most of the relationships were fairly intuitive, strong correlations were found similar chemicals. (e.g. Fixed Acidity v.s. Citric Acidity, Total Sulphur Dioxide v.s. Free Sulphur Dioxide; I believe that these are just derivatives of one another) The most interesting one I found was the positive relationship between density and fixed acidity, took me some research to figure out the potential reason for it.
Quality is most strongly correlated with Alcohol, out of all other factor, at 0.48. The strongest relationship between the other variables is Fixed Acidity v.s. pH at 0.68.
From looking at the graph, it’s clear that most of the higher quality wines tend to be in the bottom right corner of the graph, while the lower quality wint tend to be in the top left of the chart, with the medium quality wine in between. The relationship isn’t that easy to see on the graph, so I decided to redo the graph with rating instead.
As I have expected, the relationship shows more clearly on this graph.
We can see that rating seems to be better at higher sulphates level and higher citric acid level.
There’s a strong positive relationship between density and fixed acidity. Wine with higher fixed acidity has higher density. Water has density of 1, and most of the wine on here has density below 1, with some above 1. This is probably because of the alcohol content in there, which has a density of <1 (as shown with the negative correlation between density and alcohol). Acid, on the other hand, has a density >1, which at the extreme, could counter the effect of alcohol on density, and result in some wine having density >1.
In this graph, we look at quality versus the two factors with highest correlation, alcohol and volatile acidity. It is clear from the graph that lower quality wine tend to be on the upper left quardrant, represented by the orange dots, where higher quality wine tend to be in the lower right quardrant, represented by the purple dots. The average quality wine tend to be in the lower left quardrant, represented by the green dots.
In this graph, we look at quality (as represented by rating) versus Sulphates and Citric Acid. Both factors have a positive relationship with alcohol quality, the bad wine in orange tend to be in the lower left corner of the graph, and the good wine in purple, tend to be in the upper right part of the graph.
For this exercise, I explored the red wine quality dataset to look into factors that help determine the quality of red wine. I looked at the the variables individually to see their distribution. Out of the 1,599 data points, 1,319 are average wine, leaving only very few good or bad wine. I would love to get more data on good and bad wine to make a more statistically significant discovery on the factors that effect quality. I was very surprised that alcohol content had the highest correlation with quality of wine, out of all the other factors. I wonder if this is because of the lack of data points, or lack of better factors to analyze, since I definitely would not judge a wine by its alcohol content myself. As a consumer of wine, I’d also be interested in knowing the price of the wine and see how it correlates with quality.
To further analyse the data, I can look to create a model for quality from all the variables.